Skip to content

mgmtd: remove bogus "hedge" code which corrupted active candidate DS#18601

Merged
mjstapp merged 2 commits intoFRRouting:masterfrom
LabNConsulting:chopps/mgmtd-candidate-overwrite
Apr 9, 2025
Merged

mgmtd: remove bogus "hedge" code which corrupted active candidate DS#18601
mjstapp merged 2 commits intoFRRouting:masterfrom
LabNConsulting:chopps/mgmtd-candidate-overwrite

Conversation

@choppsv1
Copy link
Contributor

@choppsv1 choppsv1 commented Apr 8, 2025

Say you have 2 mgmtd frontend sessions (2 vtysh's) the first one is long
running and is actively changing the global candidate datastore (DS),
the second one starts and exits, this code would then copy running
back over the candidate, blowing away any changes made by the first
session.

(the long running session could technically be any user)

Instead we need to trust the various cleanup code that already exits.
For example in the commit_cfg_reply on success candidate is copied to
running, and on failure for implicit commit running is copied back to
candidate clearing the change. This leaves the non-implicit
configuration changes in this case we actually want candidate to keep
it's changes in transactional cases, in the other case of pending commit
during a file read the code restores candidate (if needed) on exit from
"config terminal", with this call stack:

 vty_config_node_exit()
   nb_cli_pending_commit_check()
     nb_cli_classic_commit()
       nb_candidate_commit_prepare() [fail] -> copy running -> candidate
       nb_candidate_commit_apply() -> copy candidate -> running

fixes #18541

Say you have 2 mgmtd frontend sessions (2 vtysh's) the first one is long
running and is actively changing the global candidate datastore (DS),
the second one starts and exits, this code would then copy running
back over the candidate, blowing away any changes made by the first
session.

(the long running session could technically be any user)

Instead we need to trust the various cleanup code that already exits.
For example in the commit_cfg_reply on success candidate is copied to
running, and on failure *for implicit commit* running is copied back to
candidate clearing the change. This leaves the non-implicit
configuration changes in this case we actually want candidate to keep
it's changes in transactional cases, in the other case of pending commit
during a file read the code restores candidate (if needed) on exit from
"config terminal", with this call stack:

 vty_config_node_exit()
   nb_cli_pending_commit_check()
     nb_cli_classic_commit()
       nb_candidate_commit_prepare() [fail] -> copy running -> candidate
       nb_candidate_commit_apply() -> copy candidate -> running

fixes FRRouting#18541

Signed-off-by: Christian Hopps <[email protected]>
@frrbot frrbot bot added the mgmt FRR Management Infra label Apr 8, 2025
@mjstapp
Copy link
Contributor

mjstapp commented Apr 8, 2025

I'm ... confused about why we should support multiple simultaneous config sessions. How does that work - we don't have ... sequence numbers or uuids or a way of detecting conflicts, do we? Could multiple sessions introduce inconsistencies that would lead to invalid config - could one session do some validation, and then the second introduce a change that invalidates that?

@choppsv1
Copy link
Contributor Author

choppsv1 commented Apr 8, 2025

@mjstapp Our normal CLI uses "implicit-commit" mode which means each change takes affect immediately, a lock is obtained immediate prior to making the change and released immediately after. For file loads locks are obtain prior to reading the file and released after reading the file is done.

This isn't really the point though. The code being remove is blowing away the candidate config w/o obtaining any lock, and without even entering into a config state (i.e., simply doing a vtysh -c 'show foo' is enough to wipe-out another session's in progress changes).

The code is ill-considered "just in case" "cleanup" code that is blatantly wrong.

Copy link
Contributor

@mjstapp mjstapp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we agreed in this week's dev meeting that the __ (double-underscore) functions should be renamed; I'll push this afterwards.

Having just completed a code audit during RCA, the fact that we have 2
different argument orders for the related datastore copying functions
was unnecessary and super confusing.

Fix this code-maintenance/comprehension mistake and move the newer mgmtd
copy routines to use the same arg order as the pre-existing underlying
northbound copy functions (i.e., use `copy(dst, src)`)

Signed-off-by: Christian Hopps <[email protected]>
@choppsv1 choppsv1 force-pushed the chopps/mgmtd-candidate-overwrite branch from 8eb69b5 to 59d2368 Compare April 9, 2025 10:15
@github-actions github-actions bot added the rebase PR needs rebase label Apr 9, 2025
@mjstapp mjstapp merged commit 2aa6e78 into FRRouting:master Apr 9, 2025
13 checks passed
@choppsv1 choppsv1 deleted the chopps/mgmtd-candidate-overwrite branch April 10, 2025 05:10
sudhanshukumar22 added a commit to sudhanshukumar22/sonic-buildimage that referenced this pull request Jun 3, 2025
yxieca pushed a commit to sonic-net/sonic-buildimage that referenced this pull request Jun 17, 2025
Ported the fix FRRouting/frr#18601 from FRR community

Tested using static routes, save and reboot.
K>* 0.0.0.0/0 [0/0] via 10.59.128.1, eth0, 00:03:58
L * 1.1.1.1/32 is directly connected, Loopback0, 00:03:52
C>* 1.1.1.1/32 is directly connected, Loopback0, 00:03:52
C>* 10.59.128.0/20 is directly connected, eth0, 00:03:58
L> 10.59.142.159/32 is directly connected, eth0, 00:03:58
C>* 11.1.0.0/16 is directly connected, vsh0, 00:03:58
L> 11.1.22.238/32 is directly connected, vsh0, 00:03:58
C>* 20.1.1.0/24 is directly connected, Ethernet0, 00:02:22
L> 20.1.1.2/32 is directly connected, Ethernet0, 00:02:22
S>* 40.1.1.1/32 [1/0] is directly connected, Ethernet0, weight 1, 00:00:16
S>* 40.1.1.2/32 [1/0] is directly connected, Ethernet0, weight 1, 00:00:02
mssonicbld added a commit to mssonicbld/sonic-buildimage that referenced this pull request Jun 18, 2025
Ported the fix FRRouting/frr#18601 from FRR community

Tested using static routes, save and reboot.
K>* 0.0.0.0/0 [0/0] via 10.59.128.1, eth0, 00:03:58
L * 1.1.1.1/32 is directly connected, Loopback0, 00:03:52
C>* 1.1.1.1/32 is directly connected, Loopback0, 00:03:52
C>* 10.59.128.0/20 is directly connected, eth0, 00:03:58
L>  10.59.142.159/32 is directly connected, eth0, 00:03:58
C>* 11.1.0.0/16 is directly connected, vsh0, 00:03:58
L>  11.1.22.238/32 is directly connected, vsh0, 00:03:58
C>* 20.1.1.0/24 is directly connected, Ethernet0, 00:02:22
L>  20.1.1.2/32 is directly connected, Ethernet0, 00:02:22
S>* 40.1.1.1/32 [1/0] is directly connected, Ethernet0, weight 1, 00:00:16
S>* 40.1.1.2/32 [1/0] is directly connected, Ethernet0, weight 1, 00:00:02
<!--
- Note we only backport fixes to a release branch, *not* features!
- Please also provide a reason for the backporting below.
- e.g.
- [x] 202006
-->

- [ ] 201811
- [ ] 201911
- [ ] 202006
- [ ] 202012
- [ ] 202106
- [ ] 202111
- [ ] 202205
- [ ] 202211
- [ ] 202305

#### Tested branch (Please provide the tested image version)
FRR 10.0.1
mssonicbld added a commit to sonic-net/sonic-buildimage that referenced this pull request Jun 21, 2025
Ported the fix FRRouting/frr#18601 from FRR community

Tested using static routes, save and reboot.
K>* 0.0.0.0/0 [0/0] via 10.59.128.1, eth0, 00:03:58
L failure_prs.log 1.1.1.1/32 is directly connected, Loopback0, 00:03:52
C>* 1.1.1.1/32 is directly connected, Loopback0, 00:03:52
C>* 10.59.128.0/20 is directly connected, eth0, 00:03:58
L> 10.59.142.159/32 is directly connected, eth0, 00:03:58
C>* 11.1.0.0/16 is directly connected, vsh0, 00:03:58
L> 11.1.22.238/32 is directly connected, vsh0, 00:03:58
C>* 20.1.1.0/24 is directly connected, Ethernet0, 00:02:22
L> 20.1.1.2/32 is directly connected, Ethernet0, 00:02:22
S>* 40.1.1.1/32 [1/0] is directly connected, Ethernet0, weight 1, 00:00:16
S>* 40.1.1.2/32 [1/0] is directly connected, Ethernet0, weight 1, 00:00:02
<!--
- Note we only backport fixes to a release branch, *not* features!
- Please also provide a reason for the backporting below.
- e.g.
- [x] 202006
-->

- [ ] 201811
- [ ] 201911
- [ ] 202006
- [ ] 202012
- [ ] 202106
- [ ] 202111
- [ ] 202205
- [ ] 202211
- [ ] 202305

#### Tested branch (Please provide the tested image version)
FRR 10.0.1
r12f pushed a commit to r12f/sonic-buildimage-msft that referenced this pull request Sep 27, 2025
r12f added a commit to Azure/sonic-buildimage-msft that referenced this pull request Oct 1, 2025
…ity (#1675)

Sync PR #22408 from sonic-net/sonic-buildimage to 202412.
Original PR: sonic-net/sonic-buildimage#22408

Co-authored-by: sudhanshukumar22 <[email protected]>
r12f added a commit to Azure/sonic-buildimage-msft that referenced this pull request Oct 26, 2025
…ity (#1675)

Sync PR #22408 from sonic-net/sonic-buildimage to 202412.
Original PR: sonic-net/sonic-buildimage#22408

Co-authored-by: sudhanshukumar22 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

master mgmt FRR Management Infra rebase PR needs rebase size/M

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[10.0.1] multi sessions corruption with mgmtd during bootup stage.

2 participants